Modelling Mail-In Votes In the 2020 US Election

Author

Other Contributors

Reviewers

Introduction

The Covid-19 Pandemic led to record numbers of mail-in votes in the 2020 United States Presidential Election. Because of the high volume of mail ballots, plus rules that prevented some states from counting these ballots before election day, the result of the election remained uncertain for a week, with periodic updates coming as ballots were tabulated and reported.

In particular, several states had very close races that had the potential to tip the election in favor of either candidate. Because of the Electoral College system, where states almost universally employ a "winner-take-all" model for allocating their Electoral Votes, a few states can have a large effect on the outcome of the election. For example, in 2000 the election came down to around a 500 vote margin in Florida (out of 10 million ballots cast), despite the fact that Al Gore easily won the popular vote. In 2020, a few states with very close races dominated the headlines for the week after the election, of which we will look at Pennsylvania, Arizona, and Georgia in this post. The final outcome of the election hung on the results from these states, and the slow drip feed of additional ballots being released left audiences constantly checking the news for updates.

In this Turing Data Story, I examine a few ways to analyze the data updates coming from each state to predict the final outcome. This originated from some Slack discussions with Camila Rangel Smith and Martin O'Reilly, whom I list above as contributors for this reason. In particular, our initial interest in this question centered around uncertainties in the analysis done by Camila and Martin, which I have carried out using Bayesian inference to quantify uncertainty and determine when we might have reasonably called each state for the eventual winner based on the updated data.

Data

To create the models in the post, I use the NYT Election Data Scraper to obtain the latest results which are updated every few minutes to ensure that they have the latest data. To load this data into a Python session for analysis, I can use Pandas to simply load from the CSV version of the data directly from the URL, and extract the state that I wish to examine:

Note that rather than specifying the leading and trailing candidates, I instead just convert the vote differential into a margin that is positive if Biden is leading and negative if Trump is leading. I also add columns for the total number of votes for Biden and Trump, which I will use later.

For instance, if I would like to see the data for Georgia:

The data contains a timestamp, the number of votes for each candidate, the margin, and an estimate of the number of votes remaining. This allows us to see how the vote margin evolves over time as new ballots are counted. For example, we can look at the data for all states up to midnight on 5 November to see the evolution of the race:

Note that the trend shows that Biden is catching up as more votes are counted in both Georgia and Pennsylvania, while Trump is catching up in Arizona. The trend is fairly linear. Thus, one might first consider doing a simple regression to estimate the final margin.

Linear Regression Analysis

We can do a simple analysis based on this. A linear regression model will have two parameters that are fit: the slope will be related to the fraction of the outstanding votes that are for Biden, and the intercept, which will indicate the final margin when there are no votes remaining. (This is the initial analysis that was done by Camila for Pennsylvania and Martin for Arizona.)

Note that at this point, the linear regression predicts a margin in Pennsylvania and Arizona that are quite different from the final margin. Georgia appears to be very close to the final margin. However, Arizona seems to have outlier points that muddles this analysis (which was first noted by Martin). Thus, while these models are useful starting points, they do not appear to be particularly robust and are somewhat dependent on when you choose to fit the data.

However, one thing to note about this is that even though the trends point clearly in favor of Biden in this analysis, we do not have a good idea of the uncertainties. Without this, one cannot comfortably call a state in favor of one candidate, which is why the media waited several days beyond the election to call the states for Biden, as the only way to be sure in this case is to wait for the margin to surpass the number of remaining votes. How might we develop a model that explicitly captures this uncertainty? And given such a model, when can we be confident in the result, and how does it align with the narrative from the news media? The following describes one approach for doing so.

Modelling Uncertainty in the Votes

To address this shortcoming, we turn to Bayesian Inference. Bayesian statisticians think of model parameters not as a single number, but rather probability distributions -- in this way, we can get a sense of the range of values that the model thinks are consistent with the data.

Model Structure

As noted above, the regression model has two different parameters: the slope (related to the fraction of votes that are cast for Biden), and the intercept (which is essentially the prediction of the final margin). Note that while our linear regression fit these two things simultaneously, there is no reason why we had to let the final margin be a "free" parameter that we adjusted in the fitting: we could have instead just fit a single parameter for the slope (for instance, simply using the fraction of mail ballots cast thus far), and then used that estimate to project the votes remaining in order to extrapolate and obtain our estimate of the final margin.

With this format in mind, we need to develop a way to account for the uncertainty in both of these steps. Our Bayesian model will treat both the probability that a vote goes for Biden as a probability distribution (rather than a single number), and then the final projected margin will also be a probability distribution. In practice, rather than determining the analytical form of these probability distributions, we will instead model the outcome by drawing samples from the distribution. This is a standard method within Bayesian Inference, and illustrates the power of this technique for quantifying uncertainty.

Bayesian Model of the Vote Probability

Bayesian Inference tends to think of probability distributions as reflecting statements about our beliefs. Formally, we need to state our initial beliefs before we see any data, and then we can use that data to update our knowledge. This previous belief is known as a prior in Bayesian inference, and the updated beliefs once we look at our data is known as the posterior.

Bayesian Inference

Bayesian Inference involves taking our previous beliefs about a system, described by a probability distribution of reasonable values we expect a particular parameter to take, and then using the data that we have to update our beliefs about the distribution that we expect that parameter to take by computing the posterior. This involves applying Bayes' rule:

$$ p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} $$

Here, $p(\theta)$ is the prior distribution (which we will specify before we look at the data), $p(y|\theta)$ is the likelihood (the probability that we would have gotten the data given a particular choice of $\theta$), and $p(y)$ is known as the evidence (the probability of getting that particular observation over all possible outcomes of the experiment). For many problems, it is straightforward to specify a prior and to compute the likelihood, while computing the evidence can be tricky.

In practice, for most models we cannot compute the evidence very easily, so instead of computing the posterior directly, we draw samples from it. A common technique for this is Markov Chain Monte Carlo (MCMC) sampling. A number of software libraries have been written in recent years to make carrying out this sampling straightforward using what are known as probabilistic programming languages. These languages formally treat variables as probability distributions with priors and then draw samples from the posterior to allow us to perform inference.

A Hierarchical Bayesian Model For Voting

In our linear regression model, we effectively treated the vote probability as a single, unchanging value. This is the same as saying that every voter in our model is identical. Given the political polarization in the US, this is probably not a very good assumption. Although the data seems to strongly suggest that the mail-in votes are consistently in favor of one candidate, this is not the same as saying all voters are identical. In the following, we build a model to relax this assumption, using what is known as a hierarchical Bayesian model.

If the above model of assuming that every voter is identical is one extreme, then the other extreme is to assume that every voter is different and we would need to estimate hundreds of thousands to millions of parameters to fit our model. This is not exactly practical, so a hierarchical model posits a middle ground that the vote probability is itself drawn from a probability distribution. Note that this goes a step further than simply making the model Bayesian by treating the vote probability as a probability distribution -- in the model, each incremental update of votes has a single vote probability associated with it, but that probability is drawn from a probability distribution and thus can vary. By quantifying this variability we will be able to estimate the uncertainty in the final outcome.

In the following, we model the vote probability by assuming that each vote update has a single vote probability associated with it, and that vote probability is drawn from a beta distribution. A beta distribution is a distribution defined over the interval $[0,1]$ with two shape parameters $a$ and $b$ that lets us flexibly specify a wide range of outcomes. If $a$ and $b$ are less than 1, then the distribution is biased towards the extreme values of 0 or 1, while if they are greater than 1 then the distribution is biased towards 0.5. If $a > b$, then the model is more biased towards 1, while if $b > a$ then the model is biased towards 0. Thus we can specify a range of distributions with just two parameters.

Thus, instead of estimating the vote probability, we instead need to estimate $a$ and $b$, which will tell us what we expect the distribution of the vote probability to be. Once we have estimates for those parameters, we can forecast the remaining votes by repeatedly drawing the vote probability from the appropriate beta distributions. Having multiple levels like this are why these models are known as hierarchical -- parameters are drawn from distributions whose parameters are also distributions themselves.

Since all parameters in a Bayesian model must have priors, our task is now to encode our prior beliefs about the vote probability by setting prior distributions for $a$ and $b$.

Prior

Often, in Bayesian inference we don't have strong feelings about what values we might expect for a parameter. In those cases, we often try to use something simple, what is known as an uninformative prior. These might be expressed as a statement like "every value is equally probable". Or in this case we might assume that our prior for the vote probability should be peaked close to 0.5, and then taper off towards 0 and 1, with the argument that US presidential elections are usually decided by a few percentage points difference in the national popular vote. This might seem very reasonable on the surface, as America is pretty evenly divided between Democrats and Republicans.

However, mail in votes in practice can be extremely biased towards one party. Historically, a large majority of mail in ballots are Democratic, for a variety of reasons. Trump also spent much of the campaign sowing doubt about mail-in ballots (telegraphing his post-election strategy of trying to throw them out in court), so his supporters may be much less likely to vote in this manner. However, there could also be a situation where the mail-in ballots fall heavily towards a Republican candidate (as we have seen already, more of the Arizona ballots tend to be in favor of Trump). Thus, based on this I would argue that what we actually want is a prior that is reasonably likely to include some extremes in the vote probability to ensure that our estimate of the final outcome prior to looking at the data doesn't exclude a significant swing.

This issue illustrates a challenge with Bayesian Hierarchical models -- when the parameter that we have some knowledge about is itself described by a distribution, the priors for the distribution parameters can be more difficult to specify. For this reason, modellers often go one level further and specify prior distributions on the parameters used to specify the priors on the model parameters, which are known as hyperpriors, and see how varying the priors changes the outcome of inference. We will not explore this level of Bayesian modelling, but it should suffice to say that I tried a number of different choices for the priors before arriving at something that I thought accurately reflected my prior beliefs about the outcome.

In the priors that I finally settled on, I use a Lognormal distribution for my prior on $a$ and $b$. I choose the parameters of the lognormal distributions for $a$ and $b$ to be slightly different such that more likely to give a democratically-leaning distribution, but still have a decent chance of producing extremes for Trump. I also choose the parameters such that we get a mix of values more biased towards the extremes as well as those biased towards values closer to 0.5. This should accurately reflect our prior uncertainty in the outcome, as we think there is a decent chance based on historical data that the mail votes are heavily in favor of one candidate. Here are some histograms showing single samples of the vote probability drawn from this prior, and an aggregate histogram of 100 samples:

From these individual samples, as well as the aggregated histogram, we see that we get a range of outcomes, with a slight bias towards those that favor democrats. As we acquire enough data to reliably estimate the underlying distribution of the vote probabilites, we should see better estimates of the true distribution, which will eliminate more of the extremes and reduce the uncertainty in the final outcome.

Likelihood

Finally, we need to explicitly model the likelihood. When you flip a fair coin a number of times, the distribution of outcomes follows a binomial distribution. Thus, we can use a binomial likelihood to model the range of vote probabilites that might be consistent with the votes that were cast. This can be computed analytically, and most probabilistic programming languages have built-in capacity for computing likelihoods of this type. This is done by setting this particular variable to have a known value, which indicates to the probabilistic programming language that this variable is used to compute the likelihood. In our particular case, this likelihood will be a vector of a series of trials, each with a different value of the vote probability.

PyMC3 Implementation

Thus, we can now write down a model in a probabilistic programming language in order to draw samples from the posterior. There are a number of popular lanaguages for this -- here I use PyMC3 to implement my model. PyMC3 can easily handle all of the features we specified above (hierarchical structure, and a vector representation of the binomial likelihood), which is written out in the function below:

Once I draw MCMC samples for $a$ and $b$, I convert those samples into samples of $\theta$ to see our posterior estimate of the vote probability.

Looking at these plots, we see that the model is now much more varied in its estimates for the vote probability (note that this is the posterior for the distribution expected for the vote probability, rather than the explicit values of the vote probability itself). The mean is still where we expected it from the linear regression analysis, but the distribution is much wider due the fact that occasionally votes come in from places that are not as heavily in favor of Biden (or Trump in the case of Arizona). This should considerably increase the spread of the predicted final margin and assure that it is not overconfident in the final result.

Predicting the Final Margin

Once we have samples from the vote probability, we need to simulate the remaining votes to predict the final outcome. This is known as estimating the posterior predictive distribution in Bayesian inference, as we use our updated knowledge about one of our model parameters to predict some of the data that it was fit on.

What is a reasonable way to simulate the remaining votes? As we see from the data, the votes come in a steady drip feed as ballots are counted. Thus, we can simulate this by sampling randomly, with replacement, from the data for the number of ballots cast in each update until we get to the number of votes remaining. We can then use our posterior samples of $a$ and $b$ to generate the distribution of vote probabilities, and then draw from the vote probabilites to forecast the outcome of each batch of votes using a binomial distribution. We repeat this process 10 times to ensure that the result isn't dependent on the particular realization of the drip feed simulation, and aggregate those samples to get the final estimate of the posterior predictive distribution. This should give a reasonable estimate of the final outcome based on our model.

As we can see from this, the model has fairly wide intervals surrounding the predicted final margin based on the original linear regression model. Interestingly, when we fit Georgia in this way, it looks much more likely that Trump would win through this point than the linear regression model would suggest, though the final margin found by the regression analysis is well within the error bounds suggested from the predictions. Arizona looks up for grabs, indicating that the outlier points were definitely biasing the regression analysis. Pennsylvania is much more firmly leaning towards Biden. We can look at the results again a day later to see how the race evolved:

Clearly, Georgia has swung in Biden's favor over the course of the day. The mean final margin in Pennsylvania has not moved much, though the uncertainty has tightened up and made the result more likely for Biden. Arizona could still go either way.

Animating the Updates

Now that we have built a model, we can build an animation that shows the evolution of the predicted results as a function of time. This will show how the uncertainty shrinks over time as fewer votes remain. I check for results every 30 minutes for the 12 days from 4 November onward, and update the model when new ballots are found. I also compute a Biden win probability and show the mean margin $\pm$ 2 standard deviations to give an idea of the equivalent regression result and its uncertainty.

Note: Because new MCMC samples need to be drawn for each new update, creating this animation ends up being fairly expensive to run (this took several hours on my laptop). I speed things up by saving the current prediction each time the MCMC samples are drawn, so that if the previous iteration is the same we do not need to re-run the model. However, this is still fairly expensive, so don't try and run this unless you are willing to wait!

Displaying these, we can see how the race evolves over time.

Based on this model, we can see that Pennsylvania was very clearly going in Biden's direction from early on, despite Trump's substantial lead at the end of Election Day. This was reflected by comments made by other election data journalists, all of whom were fairly confident that the numbers were good news for Biden even as early as 4 November. Biden's win probability steadily increased, surpassing 99% on 6 November. The media called the state, and the election, for Biden on 7 November.

Georgia, on the other hand, was not a sure thing. For much of the early data, our model favored Trump, who had a win probability of 82% on the evening of 4 November. However, the uncertainties were wide enough at that point that Biden's eventual victory was still not an unreasonable outcome. As the ballots shifted towards Biden, we can see a clear change on 5 November, and by that evening Biden's win probability was 70%. Biden's chances steadily increased and surpassed 99% on the evening of 7 November. However, since the final margin was still fairly small in absolute terms, the media did not call Georgia until 12 or 13 November.

Arizona, despite being the first state among these that many news outlets called, showed the largest uncertainties for much of the time period we have data, with no candidate having a clear advantage until 9 November when Biden took a slight lead in the model predictions. From there, Biden inched ahead as the remaining ballots came in, and the outcome shifted clearly in his favor on 12 November with his win probability exceeding 99% that evening. The remaining media outlets called Arizona for Biden on 13 November.

As we can see, our model is able to call the outcome of these states slightly before the media does so (possibly due to some level of conservatism). Seeing the range of uncertainties shrink is helpful to know what range of outcomes could still be reasonably expected, and can be a much more interesting way to visualize the results (particularly when animated as above).

Conclusion

This Data Story examined how we could build a simple Bayesian hierarchical model for the US election data and use it to forecast the final outcome. The model showed how the outcome in three key battleground states evolved over the week following the election as mail ballots were counted, tipping the election in favor of Biden. Because the model includes uncertainties and prior beliefs about voting behavior, this gave a richer picture of how to forecast the final result than simply extrapolating using early returns (with considerable more thought and effort required, however!). Because of the time scales over which the election played out, we could imagine having put this in place to make prospective predictions (stay tuned for 2024!) in real time to see how this simple model aligns with more experienced election forecasters.